Search CORE

854 research outputs found

University of Sheffield TREC-8 Q & A System

Author: Gaizauskas R.
Hepple M.
Humphreys K.
Sanderson M.
Publication venue: 'University of Aden - Faculty of Economics and Administration'
Publication date: 01/01/1999
Field of study

The system entered by the University of Sheffield in the question answering track of TREC-8 is the result of coupling two existing technologies - information retrieval (IR) and information extraction (IE). In essence the approach is this: the IR system treats the question as a query and returns a set of top ranked documents or passages; the IE system uses NLP techniques to parse the question, analyse the top ranked documents or passages returned by the IR system, and instantiate a query variable in the semantic representation of the question against the semantic representation of the analysed documents or passages. Thus, while the IE system by no means attempts “full text understanding", this approach is a relatively deep approach which attempts to work with meaning representations. Since the information retrieval systems we used were not our own (AT&T and UMass) and were used more or less “off the shelf", this paper concentrates on describing the modifications made to our existing information extraction system to allow it to participate in the Q & A task

White Rose Research Online

Evaluating two methods for Treebank grammar compaction

Author: Gaizauskas R.
Hepple M.
Krotov A.
Wilks Y.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/12/1999
Field of study

Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision

Crossref

White Rose Research Online

Part-of-speech Tagset and Corpus Development for Igbo, an African

Author: Hepple M.
Onyenwe I.E.
Uchechukwu C.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

This project aims to develop linguistic resources to support computational NLP research on the Igbo language. The starting point for this project is the development of a new part-of-speech tagging scheme based on the EAGLES tagset guidelines, adapted to incorporate additional language internal features. The tags are currently being used in a part-of-speech annotation task for the development of POS tagged Igbo corpus. The proposed tagset has 59 tags

Crossref

White Rose Research Online

Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language

Author: Ezeani I.
Hepple M.
Onyenwe I.
Uchechukwu C.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

The accuracy of an annotated corpus can be increased through evaluation and re- vision of the annotation scheme, and through adjudication of the disagreements found. In this paper, we describe a novel process that has been applied to improve a part-of-speech (POS) tagged corpus for the African language Igbo. An inter-annotation agreement (IAA) exercise was undertaken to iteratively revise the tagset used in the creation of the initial tagged corpus, with the aim of refining the tagset and maximizing annotator performance. The tagset revisions and other corrections were efficiently propagated to the overall corpus in a semi-automated manner using transformation-based learning (TBL) to identify candidates for cor- rection and to propose possible tag corrections. The affected word-tag pairs in the corpus were inspected to ensure a high quality end-product with an accuracy that would not be achieved through a purely automated process. The results show that the tagging accuracy increases from 88% to 94%. The tagged corpus is potentially re-usable for other dialects of the language

White Rose Research Online

Added value of bleach sedimentation microscopy for diagnosis of tuberculosis: a cost-effectiveness study.

Author: Bonnet M
Gagdnidze L
Githui W
Guérin P J
Hepple P
Ramsay A
Tajahmady A
Varaine F
Publication venue
Publication date: 01/01/2010
Field of study

SETTING: Bleach sedimentation is a method used to increase the diagnostic yield of sputum microscopy for countries with a high prevalence of human immunodeficiency virus (HIV) infection and limited resources. OBJECTIVES: To compare the relative cost-effectiveness of different microscopy approaches in diagnosing tuberculosis (TB) in Kenya. METHODS: An analytical decision tree model including cost and effectiveness measures of 10 combinations of direct (D) and overnight bleach (B) sedimentation microscopy was constructed. Data were drawn from the evaluation of the bleach sedimentation method on two specimens (first on the spot [1] and second morning [2]) from 644 TB suspects in a peripheral health clinic. Incremental cost per smear-positive detected case was measured. Costs included human resources and materials using a micro-costing evaluation. RESULTS: All bleach-based microscopy approaches detected significantly more cases (between 23.3% for B1 and 25.9% for B1+B2) than the conventional D1+D2 approach (21.0%). Cost per tested case ranged between respectively euro 2.7 and euro 4.5 for B1 and B1+D2+B2. B1 and B1+B2 were the most cost-effective approaches. D1+B2 and D1+B1 were good alternatives to avoid using approaches exclusively based on bleach sedimentation microscopy. CONCLUSIONS: Among several effective microscopy approaches used, including sodium hypochlorite sedimentation, only some resulted in a limited increase in the laboratory workload and would be most suitable for programmatic implementation

Oxford University Research Archive

MSF Field Research

University of St. Andrews - Pure

Sub-story detection in Twitter with hierarchical Dirichlet processes

Author: Bontcheva K.
Hepple M.
Preotiuc-Pietro D.
Srijith P.K.
Publication venue: 'Elsevier BV'
Publication date: 11/06/2016
Field of study

Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time – a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection – as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state-of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub-stories with high precision. This has resulted in an improvement of up to 60% in the F-score performance of HDP based sub-story detection approach compared to standard story detection approaches. A similar performance improvement is also seen using an information theoretic evaluation measure proposed for the sub-story detection task. Another contribution of this paper is in demonstrating that considering the conversational structures within the Twitter stream can bring up to 200% improvement in sub-story detection performance

arXiv.org e-Print Archive

Elsevier - Publisher Connector

White Rose Research Online

Joining up health and bioinformatics: e-science meets e-health

Author: Gaizauskas R
Hepple M
Ingram D
Kalra D
Milan J
Powers R
Rector A
Rogers J
Scott D
Singleton P
Taweel A
Publication venue: Engineering and Physical Sciences Research Council (EPSRC)
Publication date: 01/09/2004
Field of study

CLEF (Co-operative Clinical e-Science Framework) is an MRC sponsored project in the e-Science programme that aims to establish methodologies and a technical infrastructure forthe next generation of integrated clinical and bioscience research. It is developing methodsfor managing and using pseudonymised repositories of the long-term patient histories whichcan be linked to genetic, genomic information or used to support patient care. CLEF concentrateson removing key barriers to managing such repositories ? ethical issues, informationcapture, integration of disparate sources into coherent ?chronicles? of events, userorientedmechanisms for querying and displaying the information, and compiling the requiredknowledge resources. This paper describes the overall information flow and technicalapproach designed to meet these aims within a Grid framework

UCL Discovery

Automatic Label Generation for News Comment Clusters

Author: Aker A.
Barker E.
Funk A.
Gaizauskas R.
Hepple M.
Kurtic E.
Paramita M.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2016
Field of study

We present a supervised approach to automat- ically labelling topic clusters of reader com- ments to online news. We use a feature set that includes both features capturing proper- ties local to the cluster and features that cap- ture aspects from the news article and from comments outside the cluster. We evaluate the approach in an automatic and a manual, task-based setting. Both evaluations show the approach to outperform a baseline method, which uses tf*idf to select comment-internal terms for use as topic labels. We illustrate how cluster labels can be used to generate cluster summaries and present two alternative sum- mary formats: a pie chart summary and an ab- stractive summary

Crossref

White Rose Research Online

Automatic Label Generation for News Comment Clusters

Author: Aker A.
Paramita M.
Kurtic E.
Funk A.
Barker E.
Hepple M.
Gaizauskas R.
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2016
Field of study

Crossref

Biblioteca Digital de la Comunidad de Madrid

White Rose Research Online

The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News

Author: Aker A.
Barker E.
Gaizauskas R.
Hepple M.
Kurtic E.
Paramita M.L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2016
Field of study

Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in on-line news. To date, however, there has been little discussion of what these summaries should be like and a lack of humanauthored exemplars, quite likely because writing summaries of this kind of interchange is so difficult. In this paper we propose one type of reader comment summary – the conversation overview summary – that aims to capture the key argumentative content of a reader comment conversation. We describe a method we have developed to support humans in authoring conversation overview summaries and present a publicly available corpus – the first of its kind – of news articles plus comment sets, each multiply annotated, according to our method, with conversation overview summaries

Crossref

White Rose Research Online